Goto

Collaborating Authors

 unified model




UQGAN: A Unified Model for Uncertainty Quantification of Deep Classifiers trained via Conditional GANs

Neural Information Processing Systems

We present an approach to quantifying both aleatoric and epistemic uncertainty for deep neural networks in image classification, based on generative adversarial networks (GANs). While most works in the literature that use GANs to generate out-of-distribution (OoD) examples only focus on the evaluation of OoD detection, we present a GAN based approach to learn a classifier that produces proper uncertainties for OoD examples as well as for false positives (FPs). Instead of shielding the entire in-distribution data with GAN generated OoD examples which is state-of-the-art, we shield each class separately with out-of-class examples generated by a conditional GAN and complement this with a one-vs-all image classifier. In our experiments, in particular on CIFAR10, CIFAR100 and Tiny ImageNet, we improve over the OoD detection and FP detection performance of state-of-the-art GAN-training based classifiers. Furthermore, we also find that the generated GAN examples do not significantly affect the calibration error of our classifier and result in a significant gain in model accuracy.


FaceComposer: A Unified Model for Versatile Facial Content Creation

Neural Information Processing Systems

This work presents FaceComposer, a unified generative model that accomplishes a variety of facial content creation tasks, including text-conditioned face synthesis, text-guided face editing, face animation etc. Based on the latent diffusion framework, FaceComposer follows the paradigm of compositional generation and employs diverse face-specific conditions, e.g., Identity Feature and Projected Normalized Coordinate Code, to release the model creativity at all possible. To support text control and animation, we clean up some existing face image datasets and collect around 500 hours of talking-face videos, forming a high-quality large-scale multi-modal face database. A temporal self-attention module is incorporated into the U-Net structure, which allows learning the denoising process on the mixture of images and videos. Extensive experiments suggest that our approach not only achieves comparable or even better performance than state-of-the-arts on each single task, but also facilitates some combined tasks with one-time forward, demonstrating its potential in serving as a foundation generative model in face domain. We further develop an interface such that users can enjoy our one-step service to create, edit, and animate their own characters. Code, dataset, model, and interface will be made publicly available.



Uni-Hema: Unified Model for Digital Hematopathology

Rehman, Abdul, Rasool, Iqra, Imran, Ayisha, Ali, Mohsen, Sultani, Waqas

arXiv.org Artificial Intelligence

Digital hematopathology requires cell-level analysis across diverse disease categories, including malignant disorders (e.g., leukemia), infectious conditions (e.g., malaria), and non-malignant red blood cell disorders (e.g., sickle cell disease). Whether single-task, vision-language, WSI-optimized, or single-cell hematology models, these approaches share a key limitation: they cannot provide unified, multi-task, multi-modal reasoning across the complexities of digital hematopathology. T o overcome these limitations, we propose Uni-Hema, a multi-task, unified model for digital hematopathology integrating detection, classification, segmentation, morphology prediction, and reasoning across multiple diseases. Uni-Hema leverages 46 publicly available datasets, encompassing over 700K images and 21K question-answer pairs, and is built upon Hema-Former, a multimodal module that bridges visual and linguistic representations at the hierarchy level for the different tasks (detection, classification, segmentation, morphology, mask language modeling and visual question answer) at different granularity. Extensive experiments demonstrate that Uni-Hema achieves comparable or superior performance to train on a single-task and single dataset models, across diverse hematological tasks, while providing interpretable, morphologically relevant insights at the single-cell level. Our framework establishes a new standard for multi-task and multi-modal digital hematopathology. The code will be made publicly available.


AV-Dialog: Spoken Dialogue Models with Audio-Visual Input

Chen, Tuochao, Veluri, Bandhav, Gong, Hongyu, Gollakota, Shyamnath

arXiv.org Artificial Intelligence

Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for {spoken} dialogue agents that perform {robustly} in real-world, noisy environments.


A Unified Model for Human Mobility Generation in Natural Disasters

Long, Qingyue, Wang, Huandong, Wang, Qi Ryan, Li, Yong

arXiv.org Artificial Intelligence

Human mobility generation in disaster scenarios plays a vital role in resource allocation, emergency response, and rescue coordination. During disasters such as wildfires and hurricanes, human mobility patterns often deviate from their normal states, which makes the task more challenging. However, existing works usually rely on limited data from a single city or specific disaster, significantly restricting the model's generalization capability in new scenarios. In fact, disasters are highly sudden and unpredictable, and any city may encounter new types of disasters without prior experience. Therefore, we aim to develop a one-for-all model for mobility generation that can generalize to new disaster scenarios. However, building a universal framework faces two key challenges: 1) the diversity of disaster types and 2) the heterogeneity among different cities. In this work, we propose a unified model for human mobility generation in natural disasters (named UniDisMob). To enable cross-disaster generalization, we design physics-informed prompt and physics-guided alignment that leverage the underlying common patterns in mobility changes after different disasters to guide the generation process. To achieve cross-city generalization, we introduce a meta-learning framework that extracts universal patterns across multiple cities through shared parameters and captures city-specific features via private parameters. Extensive experiments across multiple cities and disaster scenarios demonstrate that our method significantly outperforms state-of-the-art baselines, achieving an average performance improvement exceeding 13%.


From Masks to Worlds: A Hitchhiker's Guide to World Models

Bai, Jinbin, Lei, Yu, Wu, Hecong, Zhu, Yuchen, Li, Shufan, Xin, Yi, Li, Xiangtai, Tao, Molei, Grover, Aditya, Yang, Ming-Hsuan

arXiv.org Artificial Intelligence

This is not a typical survey of world models; it is a guide for those who want to build worlds. We do not aim to catalog every paper that has ever mentioned a "world model". Instead, we follow one clear road: from early masked models that unified representation learning across modalities, to unified architectures that share a single paradigm, then to interactive generative models that close the action-perception loop, and finally to memory-augmented systems that sustain consistent worlds over time. We bypass loosely related branches to focus on the core: the generative heart, the interactive loop, and the memory system. We show that this is the most promising path towards true world models. The term world model has been used to describe many different ideas: learned environment simulators for reinforcement learning (Ha & Schmidhuber, 2018; Hafner et al., 2019), agents that integrate learned models with planning (Schrittwieser et al., 2020), and large language models that simulate entire societies (Park et al., 2023). Y et despite hundreds of related works, there is no clear consensus on how to actually build a true world model. In this paper, we take a stance: the path is much narrower than it appears. A true world model is not a monolithic entity but a system synthesized from three core subsystems: a generative heart that produces world states, an interactive loop that closes the action-perception cycle in real time, and a persistent memory system that sustains coherence over long horizons. The history of the field can be understood as an evolutionary journey from first mastering these components in isolation to now integrating them. Most works focus on optimizing narrow tasks and drift away from the generative, interactive, and persistent nature required for a true world model. To make this perspective concrete, we chart the historical evolution of world models as a sequence of five stages, shown in Figure 1. It begins with Stage I: Mask-based Models, which established a universal, token-based pretraining paradigm across modalities.


UniRL-Zero: Reinforcement Learning on Unified Models with Joint Language Model and Diffusion Model Experts

Wang, Fu-Yun, Zhang, Han, Gharbi, Michael, Li, Hongsheng, Park, Taesung

arXiv.org Artificial Intelligence

W e present UniRL-Zero, a unified reinforcement learning (RL) framework that boosts, multimodal language model understanding and reasoning, diffusion model multimedia generation, and their beneficial interaction capabilities within a unified model. Our work defines six scenarios for unified model reinforcement learning, providing systematic baselines for reinforcement learning of unified understanding and generation model. Our code is available at https://github.com/G-UN/UniRL.